Class Sex Age Survived Freq
1 1st Male Child No 0
2 2nd Male Child No 0
3 3rd Male Child No 35
4 Crew Male Child No 0
5 1st Female Child No 0
6 2nd Female Child No 0
Descriptive Statistics: Central Tendency and Dispersion
POLS 3312: Argument, Data, and Politics
2024-01-29
Example
“This data set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner ‘Titanic’, summarized according to economic status (class), sex, age and survival.”
Class Sex Age Survived Freq
1 1st Male Child No 0
2 2nd Male Child No 0
3 3rd Male Child No 35
4 Crew Male Child No 0
5 1st Female Child No 0
6 2nd Female Child No 0
This data is formatted the way we typically like to use data:
BEWARE! Not all data is formatted this way! Sometimes you have to think “is this a variable or a unit of observation?”
For example data is often presented with variables as rows and units of observation as columns. That’s the easy case.
Sometimes, we get data in a mixed format called wide format. For example, the following data on Scandinavian temperatures:
country avgtemp.1994 avgtemp.1995 avgtemp.1996
1 Sweden 5 5 9
2 Denmark 9 9 4
3 Norway 8 5 8
It looks like the unit of observation is country and the variable is a combination of year and temperature.
If we look at it in the long format we are used to, it’s a little clearer:
country year avgtemp
1 Sweden 1994 5
2 Denmark 1994 9
3 Norway 1994 8
4 Sweden 1995 5
5 Denmark 1995 9
6 Norway 1995 5
7 Sweden 1996 9
8 Denmark 1996 4
9 Norway 1996 8
The variable is average temperature.
The unit of observation is actually not country - it’s country-year. Sweden-1994 is one observation, Sweden-1995 is a different observation, and Sweden-1996 is a different observation all with different temperature values.
Measures of central tendency help us:
A few numbers that can summarize the center of measurement
Mean
Median
Mode
A. What is the mean of 1,5,7,9,10,12,18
A. What is the mean of 1,5,7,9,10,12,18
[1] 8.857143
[1] 8.857143
B. What is the mean of 10,20,25,30,35,40,45,50,55
B. What is the mean of 10,20,25,30,35,40,45,50,55
[1] 34.44444
[1] 34.44444
A - 1,5,7,9,10,12,18
A - 1,5,7,9,10,12,18
[1] 9
B - 10,20,25,30,35,40,45,50,55
B - 10,20,25,30,35,40,45,50,55
[1] 35
In both of our examples, the mean and median were close but not the same. That isn’t always the case.
C. 1,2,3,4,4,5,6,7
Answer:
D. 10,20,30,30,40,40,40,50,50,60,70
Answer:
Median isn’t affected by outliers
Mean gives the broader picture because it includes the outliers.
Mode is the only option for categorical variables.
We will discuss types of variables more in an upcoming class.
The three numbers are often different for the same sample or population.
Example:
Negatively skewed, Normal, and Positively Skewed distributions
Assault UrbanPop
Alabama 236 58
Alaska 263 48
Arizona 294 80
Arkansas 190 50
California 276 91
Colorado 204 78
Connecticut 110 77
Delaware 238 72
Florida 335 80
Georgia 211 60
Hawaii 46 83
Idaho 120 54
Mean Assault Arrests per 100,000 population:
[1] 170.76
Mean Urban Population Percentage:
[1] 65.54
Median Assault Arrests
[1] 159
Median Urban Population
[1] 66
Mean Assault Arrests:
[1] 170.76
Median Assault Arrests:
[1] 159
Mean Urban Population:
[1] 65.54
Median Urban Population:
[1] 66
Measures of dispersion typically look at how the data is scattered around the mean.
Let’s look at that visually.
First the mean of Assault
Then the mean of Urban Population
The problem is that because of the definition of mean, the positive lines will cancel out the negative and the dispersion or variation would always be zero!
Suppose we had a very simple data set with only two observations - 5 and 15. The mean is 10. One is 5 above the mean and one is 5 below the mean.
Point 1
[1] 5
Point 2
[1] 15
The mean
[1] 10
Distance 1
[1] -5
Distance 2
[1] 5
So, we want our new measure total_variation to equal the sum of the distances, which would be 10. But when we add 5 plus -5 we get:
The variation is:
[1] 0
What is something we can do that turns a negative number into a positive number every time and leaves a positive number as a positive?
It’s also important that any effect it has on the actual size of the numbers is consistent between positive and negative numbers.
We can square the distances
[1] 25
[1] 25
Total squared variation is:
[1] 50
Given that the actual average distances is exactly the same for both groups, does that make sense? Is it useful?
We want the average the squared differences
So our measure of variance is in the simplest form:
The variance:
[1] 25
This is actually the population variance for this simple data example.
Standard deviation is:
[1] 5
Author: Tom Hanna
Website: tomhanna.me
License: This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.</>
POLS3312, Spring 2024, Instructor: Tom Hanna